Method and system for extracting web data

ABSTRACT

An apparatus for providing an analysis of attitudes expressed in web content, comprising: a collector for collecting attitude-data in relation to a predetermined subject from one or more pre-selected web site, the attitude-data containing attitudes in relation to the predetermined subject; a processor, associated with the collector, for processing the attitude data so as to generate an attitude analysis; and an outputter, associated with the processor, for outputting the attitude analysis, thereby to provide an indication of attitudes being expressed in the web content in relation to the predetermined subject.

RELATED APPLICATIONS

The present application claims priority from U.S. Provisional PatentApplication No. 60/705,442, filed on Aug. 5, 2005, the contents of whichare hereby incorporated by reference.

FIELD AND BACKGROUND OF THE INVENTION

The present invention relates generally to an apparatus and method forpublic attitude analysis. More particularly but not exclusively, thepresent invention relates to an apparatus and a method for extractingand analyzing public attitude relevant data.

Modern organizations spend billions of dollars on Public Relations (PR)and advertisement campaigns in order to bring to the public a message,create a positive atmosphere, and influence stakeholders, opinionleaders and customers.

However, measuring the impact imposed by such campaigns on the public isvery difficult.

Traditional methods for measuring or predicting the impact imposed bypublic campaigns on the public are inherently limited.

For example, Consumer marketing research includes both attitudinal andbehavioral market research. Consumer marketing research generally refersto the study of consumers and their purchasing habits and activities.

Attitudinal research generally includes studies that focus onunderstanding consumers and how consumers make purchasing decisions.Attitudinal research can be defined as research that represents aperson's ideas, convictions or liking with respect to a specific objector idea. Opinions are essentially expressions of attitudes.Consequently, attitudes and opinions can be used almost interchangeablyto represent a person's ideas, convictions or liking with respect to aspecific object or idea. Collecting consumer purchasing informationallows, for example, product manufacturers, to drill down to humanpurchasing dispositions. Attitudinal research may assist in determiningthe likelihood of product purchase, how future products can be improved,whether product changes are acceptable, etc.

Behavioral research can be defined as the study of consumer behavior.Behavioral research studies what people do, that is, how people act.

Behavioral data, reflecting what consumers actually purchase in themarketplace, as opposed to what researchers infer consumers will or willnot purchase, has always been available. However, comprehensivebehavioral data is not always easy to capture for a variety of reasons.

The field of consumer marketing research which includes attitudinal andbehavioral market research requires gathering data related to, forexample, consumer attitudes and consumer behavior, in order to analyzesuch attitudes and behavior. Consumer data may be gathered through thedistribution of incentive items activated via participation in consumerresearch programs and consumer surveys, such as the ones described inU.S. patent Publication 20030070338, entitled: “Removable label andincentive item to facilitate collecting consumer data”. However,incentive based methods may produce biased results.

Prior art methods for measuring public attitudes include conductingpolls on a presumably representative sample of target audiences. Forexample, U.S. Pat. No. 3,950,618 entitled: “System for Public Opinionresearch” describes an automatic system for processing a public opinionpoll. However, such methods are based on an assumption that such samplesare indeed representative of the target audiences.

Another popular prior art method for evaluating public attitudes whichis very often employed involves focus group techniques. A focus group isa group of people, presumed to be representative of a target population,such as parents or customers, gathered to provide answers to open-endedquestions on specific topics and share their opinions.

Prior Art lacks methods for capturing public attitudes which do not relyon the careful selection of a representative sample or the actualbehavior and the availability of comprehensive data pertaining to theactual behavior.

Prior art has so far failed to incorporate public attitude spread byword of mouth, specifically as far as the Internet is concerned. The webadded a new dimension to the media mix—online news groups, discussiongroups, forums, chats and blogs—are all forms of communications that didnot exist ten years ago, and today they are an inseparable part of themedia mix. The public is an inseparable part of the media. The public isfed from the media and feeds the media through its new means ofcommunication.

There is thus a widely recognized need for, and it would be highlyadvantageous to have an apparatus and method for extracting andanalyzing public attitude data which is devoid of the above limitations.

SUMMARY OF THE INVENTION

According to one aspect of the present invention there is provided anapparatus for providing an analysis of attitudes expressed in web sites,comprising: a collector for collecting attitude-data in relation to apredetermined subject from at least one pre-selected web site, theattitude-data containing attitudes in relation to the predeterminedsubject, a processor, associated with the collector, for processing theattitude data so as to generate an attitude analysis, and an outputter,associated with the processor, for outputting the attitude analysis,thereby to provide an indication of attitudes being expressed in the webcontent in relation to the predetermined subject.

According to a second aspect of the present invention there is providedan apparatus for crawling web content to provide data for attitudeanalysis of attitudes expressed in the web content in relation to apredetermined subject, the apparatus comprising a crawler, configured tocrawl a plurality of pre-selected web sites, for collectingattitude-data from the web sites, the attitude data comprising attitudesrelating to the predetermined subject, the crawler being furtherconfigured to provide the attitude data to a predetermined location forthe attitude analysis.

According to a third aspect of the present invention there is provided amethod for analyzing attitudes expressed in web content, the attitudesbeing in relation to a predetermined subject, comprising: automaticallycollecting attitude data from at least one pre-selected web site, theattitude-data expressing a plurality of attitudes in relation to thepredetermined subject, electronically processing the attitude data so asto generate attitude information indicative of the plurality ofattitudes, and outputting the attitude-information, thereby to providean analysis of the attitudes in relation to the predetermined subject.

According to a fourth aspect of the present invention there is provideda device for interactive setting of a data collection policy using a webpage display, comprising a web page displayer, for displaying a web pageto a user, operable for defining a data collection policy in relation tothe web page. Preferably, the device's web page displayer is furtheroperable to define a specific data collection policy in relation to arespective region of the web page.

Unless otherwise defined, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. The materials, methods, andexamples provided herein are illustrative only and not intended to belimiting.

Implementation of the method and system of the present inventioninvolves performing or completing certain selected tasks or stepsmanually, automatically, or a combination thereof. Moreover, accordingto actual instrumentation and equipment of preferred embodiments of themethod and system of the present invention, several selected steps couldbe implemented by hardware or by software on any operating system of anyfirmware or a combination thereof. For example, as hardware, selectedsteps of the invention could be implemented as a chip or a circuit. Assoftware, selected steps of the invention could be implemented as aplurality of software instructions being executed by a computer usingany suitable operating system. In any case, selected steps of the methodand system of the invention could be described as being performed by adata processor, such as a computing platform for executing a pluralityof instructions.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is herein described, by way of example only, withreference to the accompanying drawings. With specific reference now tothe drawings in detail, it is stressed that the particulars shown are byway of example and for purposes of illustrative discussion of thepreferred embodiments of the present invention only, and are presentedin order to provide what is believed to be the most useful and readilyunderstood description of the principles and conceptual aspects of theinvention. In this regard, no attempt is made to show structural detailsof the invention in more detail than is necessary for a fundamentalunderstanding of the invention, the description taken with the drawingsmaking apparent to those skilled in the art how the several forms of theinvention may be embodied in practice.

In the drawings:

FIG. 1 is a block diagram of an apparatus for analyzing attitudesexpressed in web sites, according to a preferred embodiment of thepresent invention;

FIG. 2 is a detailed block diagram of an apparatus for analyzingattitudes expressed in web sites, according to a preferred embodiment ofthe present invention;

FIG. 3 is an exemplary main forum web page;

FIG. 4 is an exemplary forum header web page;

FIG. 5 shows an exemplary message header page;

FIG. 6 is a flow chart illustrating an implementation of a predefinedcollecting policy for a specific forum web site, according to apreferred embodiment of the present invention;

FIG. 7 shows an exemplary XML format parsed attitude-data bearing pagerepresentation, according to a preferred embodiment of the presentinvention;

FIG. 8 is a block diagram illustrating an apparatus for collectingattitude-data from web site(s) according to a preferred embodiment ofthe present invention;

FIG. 9 shows an exemplary collecting policy definer graphical userinterface (GUI), according to a preferred embodiment of the presentinvention;

FIG. 10 shows an exemplary Web site page;

FIG. 11 shows an exemplary user marked Web site page, according to apreferred embodiment of the present invention;

FIG. 12 shows an exemplary relative title position encoding in a changequery language script according to a preferred embodiment of the presentinvention;

FIG. 13 shows an exemplary pseudo-code, for crawling a specific web sitepage, according to a preferred embodiment of the present invention;

FIG. 14 is a flowchart illustrating attitude data processing accordingto a preferred embodiment of the present invention;

FIG. 15 shows an exemplary graphic representation of the results ofclustering, according to a preferred embodiment of the presentinvention;

FIG. 16 shows an exemplary graphic representation of the results ofcorrelation measurement according to a preferred embodiment of thepresent invention;

FIG. 17 shows a first graphic representation of attitude-data analysisaccording to a preferred embodiment of the present invention;

FIG. 18 shows a second exemplary graphic representation of attitude-dataanalysis according to a preferred embodiment of the present invention;

FIG. 19 is a flow diagram of an exemplary method for analyzing attitudesexpressed in web sites, according to a preferred embodiment of thepresent invention;

FIG. 20 is a flow diagram of an exemplary method for categorizingattitude-data text according to a preferred embodiment of the presentinvention;

FIG. 21 is an exemplary pseudo-code algorithm for clustering conceptsrelating to attitude-data, according to a preferred embodiment of thepresent invention; and

FIG. 22 is a simplified block diagram of an exemplary architecture of anapparatus for analyzing attitudes expressed in web sites, according to apreferred embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present embodiments comprise apparatus and method for analyzingpublic attitudes expressed in web sites or any kind of electronicinformation found on the web, in a holistic approach.

The embodiments, according to the present invention, are based oncollecting information found on the web, generally considered aninfluential medium, where authentic attitudes are expressed daily. Thewebsites are in effect, today's word of mouth (WOM) as communicated bymillions of Internet users.

Millions of web users express their views and feelings in online newsgroups, discussion groups, forums, chat sites, internet blogs etc. Allthese new means of communication, intensively used by the public today,have become a major part of the media where people are exposed to ideas,products, and messages and where people express their attitudes.

Embodiments of the present invention aim at collecting the immenseamount of high value authentic data pertaining to people's attitudesfound in Web sites and holistically analyzing the data, so as to providewith high value attitude information.

The principles and operation of an apparatus and a method according tothe present invention may be better understood with reference to thedrawings and accompanying description.

Before explaining at least one embodiment of the invention in detail, itis to be understood that the invention is not limited in its applicationto the details of construction and the arrangement of the components setforth in the following description or illustrated in the drawings. Theinvention is capable of other embodiments or of being practiced orcarried out in various ways. Also, it is to be understood that thephraseology and terminology employed herein is for the purpose ofdescription and should not be regarded as limiting.

Reference is now made to FIG. 1, which is a block diagram of anapparatus for analyzing attitudes expressed in web sites, according to apreferred embodiment of the present invention.

An apparatus 1000 according to a preferred embodiment of the presentinvention comprises a collector 110. The collector 110 is configured forcollecting data, including but not limited to attitude data, containingattitude expression, from pre-selected web site(s) 1100 the attitudedata relating to a predefined subject. Preferably, the number of thepre-selected web sites 1100 may reach hundreds of thousands of websites.

The pre-selected web sites typically include Chat sites, Interactivenews groups, Discussion groups, Forums, Blogs and the like where peopleexpress their views and feelings. For example: Internet users mayexpress their views regarding a proposed tax reform, to be discussed bya government, regarding a new product etc.

According to a preferred embodiment, the collector 110 is programmed asa crawler in a spider network, arranged to detect new attitude data inthe pre-selected web sites. For tracking the new attitude data added toa pre-selected web site, the collector 110 utilizes a script, written ina change detection language, as described in greater detail hereinbelow.

For example, the script may define which parts of a specific page of apre-selected web site bear a fixed content such as a logo of a firmoperating the site, and which parts contain dynamic content, bearingattitude data, such as a continuous flow of user's messages in a website's chat room.

In another example, the script may define a comparison to be made by thecollector 110 between current content of a page or a part of a page andattitude data previously downloaded from the same page or part of thepage.

The script may be generated using a collecting policy definer 160, asdescribed in greater detail herein below, and illustrated using FIGS. 9and 10.

The apparatus 1000, according to a preferred embodiment, furthercomprises a processor 120, associated with the collector 110, used forprocessing the attitude-data. The processing of the attitude-data maytypically include parsing the attitude-data, content analysistechniques, data mining, and other data analysis techniques. Thesetechniques may implement any one of a variety of algorithms, whichincludes but is not limited to: neural networks, rule reduction,decision trees, pattern analysis, text and linguistic analysistechniques, or any relevant known in the art algorithm.

The apparatus 1000 according to a preferred embodiment, furthercomprises a outputter 130, associated with the processor, for outputtingresultant attitude information based on the processed attitude-data.Preferably, the output information is presented to a user utilizing aset of graphical tools, as described in greater detail herein below. Thegraphical tools may be implemented as a stand alone desktop application,as a web browser based application, as a client application in aclient-server architecture, etc.

Preferably, the apparatus further comprises a data-storage 150 where theattitude-data is stored.

More preferably the data storage 150 is a data warehouse, provided witha storage area and, preferably, with advanced means for analysis of theattitude-data. In a preferred embodiment, the data warehouse is providedwith a set of graphical tools aimed at enabling a user to navigate theprocessed attitude-data, explore it, and easily find the information theuser is interested in.

The graphical tools may be implemented in as a desktop application, aweb application, or any other known in the art alternative.

In a preferred embodiment, the collector 110 may continuously monitorthe pre-selected web site(s) 1100, on a 24 hours a day and seven days aweek basis.

Optionally, a specific schedule for collecting the attitude-data may beset with respect to specific web site(s).

According to a preferred embodiment of the present invention, thecollector 110 works in a continuous mode. Preferably, the collectorutilizes a change detection language or mechanism, and downloadsrelevant pages of the pre-selected web site(s), according to apredefined collecting policy.

According to a preferred embodiment of the present invention, thecollector 110 further includes a crawler.

The crawler is responsible for crawling the pre-selected web pages fornew data, and for downloading relevant web pages there from. Preferably,the crawler is an open system that has capabilities to download all kindof data on the network including, but not limited to: Web pages, Forums,Discussion boards, and Blogs.

Reference is now made to FIG. 2 which a detailed block diagram of anapparatus for analyzing attitudes expressed in web sites, according to apreferred embodiment of the present invention.

An apparatus 2000, according to a preferred embodiment of the presentinvention comprises a GUI Manager 210 which manages the interaction witha user of the apparatus 2000.

The GUI Manager 210 includes a Correlation GUI component 201 which isconfigured to present correlation data pertaining to correlations amongphrases having relevance-relationships with a common concept relating tothe predefined subject, as found in the attitude data, as described ingreater derail herein below.

The Correlation GUI component is connected with a correlator 242 whichis configured to measure correlation between one or more phrases and arespective common concept relating to the predefined subject, asdescribed in greater detail herein below. The concept may describe anattitude towards the subject such as but not limited to negative, apositive, or a neutral attitude, including any words that do not expressa sentiment directly but may be conceptually related in people's minds.

The Correlation GUI is further connected to a Matrix Creator 241 whichis configured to a create and populate a N×N Matrix with valuesindicating distances between correlated phrase, as described in greaterdetail herein below.

The GUI Manager 210 further includes a Clustering GUI component 202which presents clusters relating to concepts in the attitude data to theuser, as described in greater detail hereinabove. The concept maydescribe any information regarding the subject, such as: attitudetowards the subject such as a negative, a positive, or a neutralattitude, as described hereinabove, or other related concepts, people,products, emotions etc.

Clustering the concepts may be carried out by a clustering engine 260,utilizing clustering methods, as described in greater detailhereinabove.

Preferably, the Correlation GUI 201 component and the Clustering GUIcomponent 202 are further connected with a Projections engine 230 whichgraphically positions items representing clusters and correlated phraseson the GUI's screens, such as the screen presented in FIG. 15 hereinbelow.

The GUI Manager 210 further includes a Trend GUI component 203 whichpresents the user with trend data. The trend data is generated by aTrend Analyzer 220 which is configured to detect trends in the attitudedata. The trend analyzer is fed by a Statistics component 244 whichgenerates statistical data pertaining to the appearing of attitudeexpressing phrases in the attitude data, as described in greater detailherein below. For example, trend GUI may facilitate the detection of ashift in public discussion of a specific concept, or expression ofspecific attitudes.

The GUI Manager 210 may further include a Statistics GUI component, 205connected to Statistics component 244, for presenting the statisticsdata generated by the Statistics component 244 to the user.

The GUI Manager 210 further includes a Quotations GUI component 204which presents the user with quotations relating to the conceptspresented to the user by the correlator GUI 201, as describedhereinabove. The Quotations GUI component 204 is fed by a Quatator 243which is configured to extract relevant attitude expressing quotationsfrom the attitude data.

The Statistics component 244, the Quatator 243, the Correlator 242, andthe Matrix creator 241 are connected to a core engine 250 which includesa parser 252, for parsing the attitude data that is downloaded fromcrawled pages and a counter 251 for counting the appearances of conceptsetc. in the attitude data, as described in greater detail herein below.

Reference is now made to FIG. 3 which shows an exemplary main forum webpage.

The exemplary main forum web page is a DVD Talk forum main web page. Inthis example, the crawler is preconfigured for crawling the web site,downloading all messages that appear in the threads (topics) of the siteforums. In this example, the crawler first crawls the links in theexemplary main forum web page, to all forum header pages 310 availablein the pre-selected web site. Preferably, the crawler is furtherpreconfigured to filter out non-relevant links so as to avoiddownloading or attempted downloading of irrelevant pages.

Reference is now made to FIG. 4 which shows an exemplary forum headerweb page.

After the crawler gets the links to the forum header pages from theexemplary main forum page, the crawler crawls relevant threads 410appearing in each of the header web pages, according to the links 310.Preferably, the crawler is pre-configured for filtering out non-relevantthreads like the general policy and search threads appearing in thisexample 411.

Reference is now made to FIG. 5 which shows an exemplary message headerpage.

According to a preferred embodiment of the present invention, for eachof the relevant threads 410, the crawler extracts relevantattitude-data, which contain attitude expressions. As shown in theexample page on FIG. 5, each message optionally comprises a date, atitle, an author, and a message body.

Optionally, each message also contains a list of quotes (quotations fromother cited messages), and signature. The quotes are marked in themessage so that, during his analysis procedure, the user can choose ifhe wants his analysis to be performed on the messages including thequotes or not. The message signatures (when present) are filtered duringthe crawling process, in order to avoid skewing the results, asdescribed in greater detail herein below.

The data has to be extracted from the page, while omitting allirrelevant information. It is important to remember that there are manytypes of irrelevant information that may be found on such message pages.The irrelevant information includes but is not limited to: othermessages, signatures, html tags, ads etc. and those vary from one siteto another.

Preferably, the collector 110 implements a predefined collecting policy.The collecting policy may include specific guidelines with respect tospecific ones of the pre-selected web sites. These guidelines may definewhich parts of the pre-selected web site(s) to crawl, in what order,etc.

For example, reference is now made to FIG. 6 which is a flow chartillustrating an implementation of a predefined collecting policy for aspecific forum web site, according to a preferred embodiment of thepresent invention.

In a preferred embodiment, the collector 110 uses a HTTP request fordownloading the relevant page(s) of the pre-selected web site(s),according to URL addresses.

Preferably, the crawler may be further configured for handling relevantaspects of the crawling such as—session objects, login information,cookies, etc.

According to a preferred embodiment, the crawler is further responsiblefor scheduling downloading processes of relevant pages of thepre-selected web site(s) (i.e. request per time quantum per site).

In addition, the crawler may be also configured for determining in whatorder the pages are downloaded. Preferably, network traffic is alsocarefully monitored by the crawler, with respect to the pre-selected websites, so as to avoid generating excessive traffic on the web site(s),by carefully scheduling the downloading process.

Optionally, the crawler verifies that a pre-defined time interval iskept between one access to a certain web site and another access, so asto try avoiding creation of network overload on the web site.

According to a preferred embodiment, several crawlers are employed inparallel in the downloading process and each of the crawlers isconfigured for downloading respective web site(s).

According to a preferred embodiment of the present invention, thecollector 110 further includes a parser. Once a relevant page isdownloaded by the crawler, it may be forwarded to the parser.

The parser is configured for parsing the relevant page and forextracting relevant attitude data from the relevant page or links toppages that contain this relevant data.

Relevant data sections may be found on the message text, message title,date, author and other places on the page. The parser is furtherconfigured for filtering out irrelevant information on the page, likehtml tags, adds, header, footer etc.

In a preferred embodiment of the present invention, the parser may applya script, customized specifically by a user for each web site, toextract relevant attitude-data from the web site, while filtering outnon-relevant or corrupted data. The non relevant data may include but isnot limited to: hidden data such as html tags and scripts that aremainly used for page definition and page control, and non relevantcontent data like texts that are presented on the web page but are notrelevant with respect to the attitude-data, such as a page number, acommercial footer, a banner etc.

In a preferred embodiment of the present invention, after irrelevantdata is removed and only the relevant attitude-data remains in the webpage, the parser converts the web page into a mark-up language formatrepresentation. Optionally, the mark-up language is XML. In the mark-uplanguage format representation, relevant data and metadata may beencoded in a searchable and indexable format.

According to a preferred embodiment, specific types of data, found onthe web page, are handled by the parser in a specific manner, inaccordance with a predefined policy.

For example, in message boards, very often a user issues a new message,citing a message previously posted by another user. The cited messageappears in the new message. However, for analysis purposes it may beignored, as it may skew the statistics of the results if it is countedtwice in spite of the fact it is not a new unique message. In anotherexample, message signatures may also skew the results, as they areidentical for all messages a specific user issues. During analysis, thewords appearing in the signatures may skew the statistics.

Thus the parser may be configured to recognize and ignore parts ofmessages such as quotations from other messages or signatures.

Reference is now made to FIG. 7 which shows an exemplary semi-XML formatparsed attitude-data bearing page representation, according to apreferred embodiment of the present invention. The provided exampleillustrates the encoding of page including a community name (DVD Talk inthe example) 710, a forum name (DVD Exchange in the example) 720, amessage title 730, a date 740, a author 750 and the body of the message760 are encoded in a searchable and indexable XML language formatrepresentation.

In a preferred embodiment of the present invention, the collector 110further comprises a data integrator (updater).

The data integrator is responsible for verifying that only relevantpages\documents crawled from the internet are stored in the data storage150. The data integrator checks that a current document does not alreadyexist in the data storage 150. The data integrator is also responsiblefor checking the completeness of the download, i.e. that no errors arefound, the parsing is carried out successfully, etc.

When the Updater identifies that all pages are downloaded (For example,according to the expected number of pages that should be downloaded), itcrawls all the user profiles, and then folds the whole downloaded dataset to the data storage 150 or calls another component, say a utility ofa data base management system (DBMS), for folding the data to the datastorage 150.

The data integrator may be configured for integrating the attitude datainto a complete and non-redundant attitude data. The integration ofattitude data by the data integrator may include but is not limited to:handling redundancy of data, preventing keeping duplicate page etc.Integrating the data may further include ensuring complete download ofall relevant data bearing pages of the web site(s). i.e.—that theattitude-data is error free, that the parsing is successfully completedetc. The data integrator may be further configured for indicating whenand if all relevant pages are downloaded.

In a preferred embodiment, the data integrator is further configured fordeciding if an apparent error, detected when downloading a web page, isrecoverable or should the web page be regarded as corrupted and beaccordingly ignored.

The data integrator may be further configured for updating the datastorage 150 with the attitude-data, while carrying out the integrationof the attitude-data as described hereinabove.

According to a preferred embodiment of the present invention, theapparatus 1000 further comprises a collecting policy definer 160 whichis associated with the collector 110 and is used for defining thecollecting policy.

Preferably, the collecting policy may address various aspects of thecollecting process. The collecting policy may define which web site(s)or what kind of web sites the collector collects attitude-data from. Thecollecting policy may provide specific guidelines for crawling through aspecific web site. The specific guidelines may define which kinds ofdata that are found on pages of the web site are to be ignored, in whatorder should the web site be crawled, how the pages are parsed, howdifferent types of data are marked up, etc.

Reference is now made to FIG. 8 which is a block diagram illustrating anapparatus for collecting attitude-data from web site(s) according to apreferred embodiment of the present invention.

An apparatus according to a preferred embodiment of the presentinvention comprises one or more crawler(s) 801. Each crawler may beassigned to respective pre-selected web site(s) 800 for crawling, tolocate and download relevant web pages carrying attitude-data therefrom.

Each of the downloaded web pages is then put in a parsing queue 803where from, in its turn, the page is parsed by a parser 805. Preferably,the parser 805 is configured to parse a web page and create a mark-uplanguage base representation of the web page. Preferably, the mark-uplanguage is XML. The parser may be further configured for forwarding theparsed page(s) to an update queue 806.

An apparatus according to a preferred embodiment also includes a dataintegrator (updater) 807. Preferably, the data integrator 807 isconfigured for fetching the parsed pages data from the updates queue806, integrating the data by handling redundancy of data, preventingfrom keeping duplicate pages etc. Integrating the data may furtherinclude ensuring complete download of all relevant data bearing pages ofthe web site(s)—such as next pages, navigation from forum to topic andthen to the message itself etc.—utilizing a request queue 809, ensuringthat the attitude-data is error free, verifying that the parsing issuccessfully completed etc. Finally, the data integrator updates adatabase (DB) or a data warehouse (DW) 810 with the parsed pagescarrying the attitude-data.

A preferred embodiment of the present invention may further include acrawl manager/scheduler 811 which manages the crawler(s) 801 andschedules the crawling of pre-selected web page(s) according to therequest queue 809, utilizing a downloads queue 813 to be used by thecrawler(s) 801. Preferably, the request queue is managed by a collectingpolicy definer 160, preferably implemented as a management console 815.

This crawl manager/scheduler 811 is responsible for scheduling thedownload process (i.e. request per time quantum per site), in additionit is responsible for the order of pages being downloaded.

Network traffic is carefully monitored by the various web sites, andtrying to avoid generating over traffic on the downloaded web sites, acarefully schedule may be implemented for the download process. TheCrawl Manager 811 is responsible for the scheduling and verifies that apre-defined time interval is kept between one access to a certain website and the other, in that way the generation of network overload onthe crawled web sites by the crawling may be avoided.

In addition, employing several crawlers 801 together allows parallelismin the downloading process, downloading many web sites in parallel whileaccessing each one only once in a while.

Also, as the ratio of new user post pages (documents) to exiting pagesis not very high, an updated list of the new post pages may bemaintained and used for further reducing crawling activities on thecrawled web sites.

Reference is now made to FIG. 9 which shows an exemplary collectingpolicy definer graphical user interface (GUI), according to a preferredembodiment of the present invention.

According to a preferred embodiment of the present invention, thecollecting policy definer 160 includes a graphical user interface (GUI),which graphically facilitates the definition of a collecting policy by auser of the apparatus 1000.

In the exemplary collecting policy GUI of FIG. 9, on the top of thescreen there is a settings bar where the user inputs the address of theweb site page 910 for collecting attitude-data there from, the outputfile 920 to save the results in, and the page type 930.

Below the setting bar there is a window 950 where the user may provideother definitions. For example: The color coding for each of the markedfields, the date format being used in this particular site (e.g.European, American, or other) and optionally other relevant definitions.

Below the window 950 is the main working area of the application 980.The main working area 980 has the behavior of a browser and loads a website page so as to allow the user to define the specific collectingpolicy with regards to the specific web site page.

When the page is loaded the user may mark the relevant parts on thepage, indicating what section reflects what part of information to becrawled, or optionally, to be ignored. This operation is preferablyrepeated for each part of a web site (i.e.: forums list, topic list,message pages, author profiles, etc.).

According to a preferred embodiment of the present invention, thecollecting policy definer 160 is configured to use the definitions madeusing the GUI, as described hereinabove, for generating a script encodedcollecting policy. The collecting policy may specifically define howeach element on the page is crawled or parsed.

Reference is now made to FIG. 10 which shows an exemplary Web site page.

The exemplary page (http://dvdtalk.com/forum/forumdisplay.php?f=8) is aForum web site page. The exemplary page has several main parts: aheader, headlines, banners, a quick launch area for starting frequentlyused forums, and a list of forums.

Using the GUI of the collecting policy definer 160 the user maygraphically select part(s) of the web site page and define a collectingpolicy for the part(s) as well as for the whole page.

Reference is now made to FIG. 11 which shows an exemplary user markedweb site page, according to a preferred embodiment of the presentinvention.

The web site page of FIG. 10 is now presented having its main partsgraphically selected and marks by the user.

With regards to the exemplary page, the user may define that only theelements of the list of forums 1110 are to be crawled and parsed.

According to a preferred embodiment, each element is regarded by thecollecting policy definer 160 as having a position relative to a parentelement.

In the example of FIG. 11, each element 1111-1112 of the list of forums1110 has a relative position with respect to the header of the list1120. Consequently, when the absolute position of the header is changed,say when a new advertisement banner is positioned by an operator of theweb site, just above the list of forums, the relative position of eachelement on the list remains the same.

Reference is now made to FIG. 12 which shows an exemplary relative titleposition encoding in a change query language script according to apreferred embodiment of the present invention.

The provided exemplary position is relative to a header of an htmltable. The table header has fixed position on the page, and the shownexemplary relative title position is encoded in relation to the fixedposition. The parser 805 uses the definition provided by the user asillustrated in FIG. 10-11 and explained herein above, to correctlyencode a generic title position relative to the fixed position of thetable header.

Reference is now made to FIG. 13 which shows an exemplary pseudo-code,for finding the specific element definition in the collecting policy GUI(FIG. 9) on a specific web site page, according to a preferredembodiment of the present invention.

The provided exemplary pseudo-code describes a sequence of steps forcollecting policy GUI (FIG. 9) to extract from the exemplary web sitepage that may be a part of the collecting policy, encoded in a script,based on user provided definitions, as illustrated and explained usingFIGS. 10-11 hereinabove.

According to a preferred embodiment of the present invention, thecollector 110 is configured for carrying out several steps of processingwith regards to the collected attitude-data.

Reference is now made to FIG. 14 which is a flowchart illustratingattitude data processing according to a preferred embodiment of thepresent invention.

According to a preferred embodiment of the present invention, theattitude data is processed in a pipeline mode, wherein eachdocument/message in the crawled web pages undergoes a series of stepsthat are applied to it in a row.

According to a preferred embodiment, Internet sites 1400 are crawled fornew messages, bearing attitude-data, based on a script in a change querylanguage as described herein above.

Preferably, any given web page may be downloaded using HTTP protocol.However, the page has to be parsed in order to extract its information.This is already the role of the above described parser.

The parser may represent the downloaded web page as a XML tree, andapply a change query language script, specifically customized for eachweb site, to extract the relevant information from it, skipping all thenon-relevant info.

For example, the change query language may be an Extensible Style sheetLanguage Transformation XSLT language, which is a language fortransforming XML documents into other XML documents.

The XSLT script may have the ability to ignore all kind of non relevantdata, based on an ad-hoc customization, as described in greater detailherein below, for the collecting policy definer.

The relevant pages are downloaded and parsed 1401 to identify theirrelevant text section and the metadata relating to the new attitude-datasuch as: title, author, or date, is extracted from the collectedattitude-data.

The processor 120 may further include a runtime environment which may befurther configured for labeling each message/document with relevantmetadata.

Then, the processor 120, using an on-line interface, categorizes 1410relevant texts of the collected attitude-date using supervisedapproaches.

Next, the processor 120 carries out classical text categorization bycontent, which involves assigning each message/document a list of topicsbeing discussed in it, based on the identification and analysis ofissues discussed in the collected attitude data. In addition, processor120 carries out text categorization by sentiment, which involvesassigning each message/document it's polarity label (positive, negativeor neutral).

According to a preferred embodiment of the present invention, thecontent based categorization of the collected attitude-data may be basedon an output generated by a training/testing environment which may be apart of the processor 120, and may be used to form the model forcategorizing the attitude expression, i.e. the logic of how to identifytitles, topics, age groups, gender etc, as described in greater detailherein above.

Optionally, the processor 120 may utilize one of the text categorizationtechniques in a range which includes but is not limited to: FeatureSelection, Feature filtering, and Training as described in greaterdetail herein below.

In a preferred embodiment, the processor 120 is also configured to carryout text categorization by style technologies. Such technologies may addand categorize vital data about the document author, like his age orhis/her gender, without having any direct background knowledge about theauthor.

Categorization by style technologies are based on the idea of analyzingthe writing style, the language used by the author, the use of foreignlanguage words etc. to indirectly learn about the author. Learning aboutthe author, the attitude data may be categorized according an age group,gender, etc.

Style text categorization may enrich the queries and analysis the enduser can perform on the data. Since this style derived information isstatic, it may be generated in a metadata pre-processing stage as well.

According to a preferred embodiment of the present invention theprocessor 120 may include a Statistics Generator for generating variousstatistics relating to the collected attitude-data.

Preferably, the processor 120 includes data mining tools for mining 1412the collected and processed attitude-data, so as to provide a user withmeans for carrying out pattern analysis and trend detection 1430 in theattitude data.

According to a preferred embodiment of the present invention, theprocessor may implement any of the methods described hereinbelow forcategorizing the texts of the attitude-data and for further analyzingthe attitude-data, say for providing statistics relating to theattitude-data or for mining the attitude-data.

The results of the categorization and data mining steps are output andstored in a data storage (a database or a data warehouse) 1420.

Preferably, the processor 120 may further include a concept analyzer,operable by an analyst/user 1450 for concept analyzing 1431 the attitudedata, for finding in the attitude-data relevance-relationship(s) betweena phrase, comprising one or several words, and a respective concept, asdescribed in greater detail hereinbelow.

More preferably the processor 120 may also include a correlationmeasurer, configured for measuring 1432, in the attitude-data,correlations among phrases having relevance-relationships with a commonconcept, and for measuring correlation between one or more of thesephrases and the common concept, as described in greater detail hereinbelow.

According to a preferred embodiment of the present invention, theprocessor 120 may further include a quotation extractor, for extracting1433 from the attitude-data key quotations which are found to bedescriptive of a relevance-relationship existing in the attitude-databetween a concept and respective phrases (comprising one or more words),as described in grater detail herein below.

According to a preferred embodiment, the processor may further include aclusterer. The clusterer may be operable by a user/analyst 1450 forclustering concepts 1434 relating to the attitude-data, as described ingreater detail herein below.

According to a preferred embodiment of the present invention, theoutputter 130 provides a user 1440 or an analyst 1450 with variousgraphical tools for examining, exploring, and analyzingattitude-information, generated by collecting and processing theattitude-data. Optionally, the graphical tools may be provided as a webapplication 1442, so as to allow the user to examine and explore theattitude data remotely via the web.

Reference is now made to FIG. 15 which shows an exemplary graphicrepresentation of the results of clustering concepts in the attitudedata, according to a preferred embodiment of the present invention, asdescribed hereinabove.

With clustering, individual messages are analyzed for a central attitudeand then added a corresponding cluster of attitudes.

In the central part of the screen the user can see the generatedclusters as circles 1501, clusters with more messages/documents aredenoted as bigger circles, their distance is displayed by their visuallayout. Clusters that are in the red-region 1503 are clusters ofnegative attitude, while positive attitude ones are in the green part1505.

On the left screen side, the user can see the topic of each cluster1507. Clicking on one of the clusters displays to the user a set ofrelevant message/document citations for each of the clusters.

Reference is now made to FIG. 16 which shows an exemplary graphicrepresentation of the results of correlation measurement according to apreferred embodiment of the present invention.

The correlation measurer, discussed hereinabove, measures correlationsof relevant phrases for a central concept as well theircross-relationships. An exemplary visualization of results of themeasurement is shown in FIG. 16.

In the center is the main concept (“USA”) 1601 surrounded by wordsindicating anti-American attitude expression in the web. The colorsdescribe the various phrase types that are related to the central term,and their cross relations, according to a provided legend 1605.Optionally, the layout algorithm may be based on a SVD (factor analysis)formula combined with MDS (multi dimensional scaling), wherein an n×nmatrix is used to measure the distances between each pair among therelevant phrases, n denoting the number of phrases.

Reference is now made to FIG. 17 which shows a first exemplary graphicrepresentation of attitude-data analysis according to a preferredembodiment of the present invention.

According to a preferred embodiment the outputter 130 includes a userfriendly graphical front end environment for defining and viewingattitude-information. Preferably, there are two types of front end: adesktop application and a web based client.

For example, the front end environment may provide a user with means fortracking trends, buzz, and sentiment, which are preferably based on thedata warehouse 150 capabilities such a multidimensional data analysis.

Users may analyze their company's\product's word of mouth over timeaccording to the different markets and vertical markets. Such analysismay prove very beneficial for the users.

In addition the user has the ability to compare his company\product toother products or companies in his vertical market or to a benchmark,set according to an industry standard. For example, as illustrated inFIG. 17, the user may investigate the concept of the top ten movies1701, as depict in a chart showing the trend among the ten most popularin a monthly basis 1703.

Reference is now made to FIG. 18 which shows a second exemplary graphicrepresentation of attitude-data analysis according to a preferredembodiment of the present invention.

Preferably, more advanced capabilities then the ones presented in FIG.17 are available for the advanced user. For example, as shown in FIG.18—analysis according to gender 1801, analysis according to age range1803, selection of chart types 1805, selection of axis data 1807, etc.are further available more advanced capabilities.

For example, when the user chooses to analyze the sentiment with regardsto his product according to gender 1801, with respect to all age groups(combined) 1803, he may be presented a bar chart 1810 depicting thepositive vs. negative vs. natural attitudes towards his produce.

Reference is now made to FIG. 19 which is a flow diagram of an exemplarymethod for analyzing attitudes expressed in web sites, according to apreferred embodiment of the present invention.

According to a preferred embodiment, attitude data 1900 relating to asubject which is predetermined by a user, say using the apparatus 1000,is collected 1901 from pre-selected web site(s), say by a collector 110,as described hereinabove.

The pre-selected web sites may include, but are not limited to: Chatsites, Interactive news groups, Discussion groups, Forums, blogs and thelike where people express their views and feelings. For example:Internet uses may express their views regarding a proposed tax reform,to be discussed by a government, regarding a new product etc.

Optionally, the collecting may include any number of web sites.

Next, the collected attitude-data is processed 1903, say by a processoras described hereinabove.

The processing 1903 of the attitude-data may typically include contentanalysis techniques, data mining, and other data analysis techniques.These techniques may implement any one a variety of algorithms, whichincludes but is not limited to: neuronal networks, rule reduction,decision trees, pattern analysis, text and linguistic analysistechniques, or any relevant known in the art algorithm. Detailedexemplary algorithms, usable for processing of the attitude-data areprovided herein below.

Finally, the processed attitude-data is used for outputting 1905attitude-information to a user, say by an outputter 130, as describedhereinabove.

The outputting 1905 may be carried out utilizing graphical tools forpresenting and analyzing attitude-information, as described in greaterdetail hereinabove.

According to a preferred embodiment of the present invention, thecollecting 1901 may include crawling the web sites according to apredefined policy. Preferably, the collecting further includes parsingrelevant downloaded pages of the pre-selected web sites, as described ingreater detail hereinabove.

Preferably, the crawling is carried out according to a policy defined bya user, say by a collecting policy definer 160, as describedhereinabove.

According to a preferred embodiment the processing 1903 is carried outin an initial pre-processing step, where metadata relating to thecollected attitude-data is processed in advance.

According to a preferred embodiment, the processing 1903 includescategorizing relevant text of the collected attitude-date usingsupervised approaches.

Preferably, in addition to classical text categorization by content,which involves assigning each message/document a list of topics beingdiscussed in it, a preferred embodiment may include using textcategorization by style technologies. Such technologies may add andcategorize vital data about the document author, like his age or his/hergender, without having any direct background knowledge about the author.

As described herein above, categorization by style technologies arebased on the idea of analyzing the writing style, the language used bythe author, the use of foreign language words etc. to indirectly learnabout the author.

Style text categorization may enrich the queries and analysis the enduser can perform on the data. Since this style derived information isstatic, it can be generated in a metadata pre-processing stage.

Reference is now made to FIG. 20 which is a flow diagram of an exemplarymethod for categorizing attitude-data text according to a preferredembodiment of the present invention.

The general flow of the exemplary categorization process includes: datamanipulation 2001, and then feature selection 2003 and feature reduction2005, applied, as described in greater detail hereinabove, for yieldinga feature set/cluster 2010. The example further includes train\test 2015procedures for forming a model which best represents theattitude-information in the collected attitude-data.

Data Manipulation

Texts cannot be directly interpreted by a classification system. Becauseof this, an indexing procedure that maps a text into a compactrepresentation of its content is preferably uniformly applied totraining, validation, and testing of messages/documents, forsuccessfully carrying out the categorization and mining of the attitudedata.

The choice of a representation for text depends on what one regards asthe meaningful units of text (the problem of lexical semantics) and themeaningful natural language rules for the combination of these units.Similarly to what happens in IR (Information Retrieval), in TC (TextCategorization) a text may be represented as a vector of pairs of termsand their weights. Each of the document terms (sometimes calledfeatures) occur at least once (in at least one message/document). Thereare different ways to understand what a term is and different ways tocompute term weights.

A typical way for understanding a term is to identify the term using aword. The way is often referred to as either the set of words or the bagof words approach to document representation, because a bag or set ofwords is available from which to select the meaning of the term. Withthe bag of word approach, a list of words and word combinations isweighted according to the number of appearances of each word or wordcombination in the document. Predefined stop words/combinations are thenexcluded from the list, and the term is understood in light of theweights of the remaining words/combinations.

Feature Selection

Feature selection may relate to various types of features ranging fromtextual ones, like words, dictionary based words and also some moregrammatical features like part-of-speech tags and their combination.Preferably, Feature selection further includes combinations of phrases,represented as N-grams. N-grams are phrases combining a number (n) ofwords.

Feature Filtering

Unlike in text retrieval, in TC the high dimensionality of the termspace may be problematic, as the objective of TC is to extract anattitude from a mass of words rather than to search for a given phrase.In fact, while typical algorithms used in text retrieval can scale up tohigh values of terms, the same does not hold of many sophisticatedlearning algorithms used for TC which is about extracting the generalattitude rather then its detailed expression.

Preferably, because of this problem, a Feature filter is alsoimplemented. The effect of the filtering is to reduce the size of theterm space. The filtering may apply methods for feature reduction thatinclude but are not limited to: dictionary based reduction, termfrequency reduction, and information-gain filtering.

With dictionary based reduction, a limitation is made to a certain groupof words that appears in a predefined dictionary words list (likefunction words).

Term frequency reduction is based on filtering out features that appearin too many messages/documents, such as “I” and “The”, or in too fewmessages/documents. That is to say, terms that appear in too manymessages are regarded as too general whereas terms that appear in toofew messages are regarded as too specific. Information gain filteringmeasures the decrease in entropy as a result of the presence of acertain term in the text. This is useful to identify the features thatare best distinguishing between groups in the space ofdocuments/messages.

For example, entropy may be formally defined as:${{IG}(t)} = {{\sum\limits_{1}^{m}{{P\left( C_{i} \right)}\log\quad{P\left( C_{i} \right)}}} + {{P(t)} \cdot \left\lbrack {\sum\limits_{1}^{m}{{P\left( {C_{i}\text{❘}t} \right)}\log\quad{P\left( {C_{i}\text{❘}t} \right)}}} \right\rbrack} + {{P\left( \overset{\_}{t} \right)} \cdot \left\lbrack {\sum\limits_{1}^{m}{{P\left( {C_{i}\text{❘}\overset{\_}{t}} \right)}\log\quad{P\left( {C_{i}\text{❘}\overset{\_}{t}} \right)}}} \right\rbrack}}$Where:C denotes a category.${P\left( C_{i} \right)} = \frac{\#{docs}\quad{in}\quad{category}\quad C}{\#{docs}\quad{in}\quad{all}\quad{categories}}$${P(t)} = \frac{\#{docs}\quad{where}\quad t\quad{appears}}{\#{all}\quad{docs}}$${P\left( {C_{i}\text{❘}t} \right)} = \frac{\#{docs}\quad{where}\quad t\quad{appears}\quad{in}\quad C_{i}}{\#{all}\quad{docs}\quad{where}\quad t\quad{appears}}$${P\left( \overset{\_}{t} \right)} = {1 - {P(t)}}$${P\left( {C_{i}\text{❘}\overset{\_}{t}} \right)} = \frac{\#{docs}\quad{where}\quad t\quad{does}\quad{not}\quad{appears}\quad{in}\quad C_{i}}{\#{all}\quad{docs}\quad{where}\quad t\quad{does}\quad{not}\quad{appear}}$

Train\Test Procedure

Preferably, one or more machine learning algorithms is applied on thedata set to find a model which best extracts attitude data from themessages/document downloaded from the crawled web sites.

For example, given a collection of messages/documents discussing“sports” and “non-sports”, the model learns how to distinguish sportmessages/documents from non-sport ones.

In order to do this several models of text categorization may be appliedin including but not limited to: Decision Tree (J48), Naïve Bayes, andSVM.

Decision Tree—a decision tree (DT) for text categorization is a tree inwhich internal nodes are labeled by terms, branches departing from themare labeled by the weight that the term has in the test document, andleafs are labeled by categories.

Such a tree categorizes a test document by recursively testing theweights that the terms labeling the internal nodes have in a vector,until a leaf node is reached. The label of this node is then assigned tothe document. Most such trees use binary document representations, andare thus binary trees.

There are a number of standard packages for DT learning, and most DTapproaches to TC have made use of such packages. Among the most popularones are ID3 (used by Fuhr et al. [1991]), C4.5 (used by Cohen and Hirsh[1998], Cohen and Singer [1999], Joachims [1998], and Lewis and Catlett[1994]), and C5 (used by Li and Jain [1998]).

Naïve Bayes—Let X be the data record (case) whose class label isunknown. Let H be some hypothesis, such as “data record X belongs to aspecified class C.” For classification, we want to determine P(H|X)—theprobability that the hypothesis H holds, given the observed data recordX.

P(H|X) is the posterior probability of H conditioned on X. For example,the probability that a fruit is an apple, given the condition that it isred and round. In contrast, P(H) is the prior probability, or a prioriprobability, of H.

In this example P(H) is the probability that any given data record is anapple, regardless of how the data record looks. The posteriorprobability, P (H|X), is based on more information (such as backgroundknowledge) than the prior probability, P(H), which is independent of X.

Similarly, P (X|H) is posterior probability of X conditioned on H. Thatis to say, it is the probability that X is red and round given that weknow that it is true that X is an apple. P(X) is the prior probabilityof X, i.e. it is the probability that a data record from our set offruits is red and round.

Bayes theorem is useful in that it provides a way of calculating theposterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayestheorem may be formally defined by the equation:P(H❘X) = P(X❘H)P(H)/P(X).

SVM—The support vector machine (SVM) method has been introduced in TC byJoachims [1998, 1999] and subsequently used by Drucker et al. [1999],Dumais et al. [1998], Dumais and Chen [2000], Klinkenberg and Joachims[2000], Taira and Haruno [1999], and Yang and Liu [1999].

In geometrical terms, it may be seen as an attempt to find, among allthe surfaces_1, _2, ::: in j. T j-dimensional space that separate thepositive from the negative training examples (decision surfaces), the _ithat separates the positives from the negatives by the widest possiblemargin. That is to say, such that the separation property is invariantwith respect to the widest possible translation of _i.

This idea is best understood in a case where the positives and thenegatives are linearly separable, in which the decision surfaces are (jTj−1)-hyper planes.

The SVM method chooses the middle element from the “widest” set ofparallel lines, that is to say, from the set in which the maximumdistance between two elements in the set is highest. It is noteworthythat this “best” decision surface is determined by only a small set oftraining examples, called the support vectors. The method described isapplicable also to a case where the positives and the negatives are notlinearly separable.

As argued by Joachims [1998], SVM offers two important advantages forTC: One being that term selection is often not needed, as SVM tends toresistant to overfitting—that is to producing a too complex statisticalmodel compared with the amount of data, and can handle largedimensionality, and the other being that no human and computerprocessing effort in parameter tuning on a validation set is needed, asthere is a theoretically motivated default choice of parameter settingswhich has also been shown to provide the best effectiveness.

The above described methods and algorithms are usually implemented in anon-line supervised manner, involving an analyst/user. A preferredembodiment of the present invention further implements unsupervisedapproaches. Preferably the unsupervised approaches facilitate processingrelatively large volumes of textual attitude-data.

A preferred embodiment of the present invention involves unsupervisedapproaches that are based on data mining techniques.

A preferred embodiment of the present invention may utilize a two layersapproach. One layer is an application layer and the other is an openquery layer where the user may define relevant queries.

The application layer may use, but is not limited to using:

Data representation—a data representation component may be used forinternally representing text of the attitude-data.

Memory and performance efficient data-structures are essential forperforming the complex online analysis tasks. The data representationcomponent translates the text to a compact binary representation,enabling faster analysis, for example using following steps.

Frequency analysis—a frequency analyzer may be used to provide the userwith various statistics on different parameters, like: most frequentwords, phrases, number of authors, unique authors, or distribution overtime frame. The frequency analyzer may utilize a counter for countingwords, phrases, etc. The counter provides raw data that is thenprocessed by the frequency analyzer, to generate various statisticsdata.

Concept Analysis—a concept analyzer may be employed for finding the mostinteresting and relevant phrases relating to a certain concept, in theattitude-data.

The analysis handles single word phrases as well as relevant multipleword phrases. The concept analyzer may scan all the words or phrases inthe collection, and assign a relevance score to each of them, toindicate relevance of the word or phrase to the researched concept.

Preferably, the relevance is measured by the ratio between a frequencyfor the word/phrase for co-occurring with a “leading concept/word” (i.e.the concept/word currently being analyzed) to the frequency of theco-occurrence not with the “leading concept/word”. The higher this ratiois, the more relevant is this word/phrase.

In order to extract phrases (longer than one word), the analysis mayinclude examining the top K (usually 100) words, and then look forphrases containing at least one of the top K words. Those phrases whoserelevance score (as being calculated for single words) is higher than acertain threshold are considered relevant.

Correlator measurement—according to a preferred embodiment, acorrelation measurer may be used to reveal interesting relationshipsbetween phrases and concepts in the attitude-data.

When trying to analyze a concept, one of the important information iswhat is mentioned\related to this concept, and how these areissue-related. This is done by measuring correlation.

According to a preferred embodiment of the present invention, therelevant phrases that were identified in the concept analysis stage arepopulated in a matrix where the distances between all the pair ofphrases are calculated, as described ion greater detail herein below.

Then, the matrix may be populated into a visual interface, with theanalyzed concept/phrase in the middle, and the relevant phrasessurrounding it, as illustrated in FIG. 14 and discussed hereinabove.

The distance from the central concept measures the relevance to it, andthe distances among the other phrases themselves represents theircloseness. These metrics are directly derived from the distances in thedistance matrix, populated as described below.

Preferably, in order to calculate the distance between two phrases, twoparameters are taken into consideration: the significance of theco-occurrence of these phrases and the frequency of this occurrence.

According to a preferred embodiment, the distance between phrases a andb is calculated according to the formula:${D\left( {a,b} \right)} = \frac{{{freq}\left( {a,b} \right)}^{1.5}}{{freq}\left( {a,\overset{\sim}{b}} \right)}$

freq(a,b)=Frequency for a to co-appear with b, for some measure oftogetherness

freq(a,{tilde over (b)})=Frequency for a to appear where b does notappear

Note that D(a,b) is not symmetric with D(b,a).

In a preferred embodiment, a distance between the two phrases, as put inthe matrix, is the maximum of the two: DV(a, b) = Max(D(a, b), D(b, a))

Quotation extraction—a quotation extractor is preferably employed forextracting key quotations from a data file that contains a given list ofconcepts, in order to provide a user with the relevant text citationsbest describing a relationship, existing in the attitude-data between aconcept and its neighbor (relevant) phrases.

The challenge in the above case is identifying ad-hoc the most relevantdocuments, finding in them, the most relevant phrases and thendisplaying the phrases to the end user. The relevance in this case ismeasured by the frequency of the searched phrases in the text, incoordination with their distance in the message/document itself.

Clustering—according to a preferred embodiment, the concepts relating tothe attitude-data may be clustered, say by a clusterer, as discussedhereinabove.

Clustering may include aimed clustering which includes clustering theconcepts that strongly relate to a given topic. Clustering may alsoinclude free clustering where a given attitude-data set is clusteredinto distinctive groups which strongly relate to one another. Thisfunctionality is useful when analyzing new domains where the analystdoesn't have any prior knowledge on it.

Reference is now made to FIG. 21 which is an exemplary pseudo-codealgorithm for clustering concepts relating to attitude-data, accordingto a preferred embodiment of the present invention.

Free clustering may be implemented using a clustering algorithm asexemplified using FIG. 21, to provide the user with the list of mostrelevant document clusters in the collection, along with cluster namesand list (and view) of the documents belonging to each cluster.

The algorithm of FIG. 21 has the following advantages over the classicalclustering algorithms: no predefined fixed number of clusters as in theclassical clustering algorithms, ability to control the words that buildthe different clusters, and ability to merge and split clusters.

A general well known problem of traditional clustering algorithmspertains to relevance of the generated clusters to needs of theend-user, and that the traditional algorithms are based on theend-user's previous knowledge of well known world facts.

The example algorithm enables the user to control the output and qualityof the final clusters, thus overcoming these shortcomings.

According to a preferred embodiment of the present invention, theprocessing of the attitude-data further includes data mining techniques.

Preferably, the data mining techniques may include, but are not limitedto Pattern analysis and Trend analysis.

With Pattern analysis the processing includes searching for patterns inthe statistics that may be provided by a statistics generator asdescribed hereinabove.

The process may reveal relationships that are not obvious or sift outmeaningful data from noise, exploiting favorable patterns and avoidingbad ones. Pattern analysis is a traditional part of data miningalgorithms as applied on data stored in relational databases. However,in a preferred embodiment, Pattern analysis is further applied tounstructured textual data.

With Trend analysis, the processing further includes detecting emergingtrends in the attitude-data, like new emerging products, consumer habitsand more.

Optionally, trend analysis may be done by applying linear regressionprinciples on the data set results. Once a list of related phrases isdiscovered, an analysis of correlation trends over time using linearregression is carried out

If a strong positive (or negative) correlation trend (by having a highabsolute value of the correlation derivative) is discovered, it ischecked for consistency over time, by measuring the mean squared error.

The phrases that have the strongest trend derivative, and the leasterror, are regarded as those with the higher trends, and are displayedto the user along with their trend graph, and regression equation.

Platform Architecture

Reference is now made to FIG. 22 which is a simplified block diagram ofan exemplary architecture of an apparatus for analyzing attitudesexpressed in web sites, according to a preferred embodiment of thepresent invention.

An architecture according to a preferred embodiment may be a distributedenvironment architecture having loosely coupled components 2221-5,communicating through one central fault tolerant management and datacenter 2230.

High availability of the data center 2230 is ensured by running the datacenter in a computer server cluster with redundant machines.

The central data center 2230 preferably runs on top of a central datastorage (data base/data warehouse) 2235, secured with redundant machinesensuring high-availability. The data-center 2230 stores the currentsystem status and configuration (along with data to be analyzed) as wellas the communication messages between the various system components.

Having a message based communication system enables full distribution ofthe various run time components, thus having full scaling capability.This architecture also enables real time configuration changes,affecting immediately all the running components without requiring arestart of the whole system or waiting for long update time.

Preferably, all the components communicate in an asynchronous mode,using messages. All the messages are posed to queues waiting forprocessing by each of the components. Each component owns one inputmessage queue, one output queue and one management (commands) queue. Theinput queue contains the processing requests waiting for a component tobe processed, upon completion, the processed document is posted to anoutput queue (which is actually the input for the next component in thepipeline).

An apparatus according to a preferred embodiment of the presentinvention may provide means for proper storage for any volume of datawith fast access capabilities.

It is expected that during the life of this patent many relevant devicesand systems will be developed and the scope of the terms herein,particularly of the terms “Collector”, “Processor”, “Outputter”,“Database” and “data Warehouse”, is intended to include all such newtechnologies a priori.

Additional objects, advantages, and novel features of the presentinvention will become apparent to one ordinarily skilled in the art uponexamination of the following examples, which are not intended to belimiting. Additionally, each of the various embodiments and aspects ofthe present invention as delineated hereinabove and as claimed in theclaims section below finds experimental support in the followingexamples.

It is appreciated that certain features of the invention, which are, forclarity, described in the context of separate embodiments, may also beprovided in combination in a single embodiment. Conversely, variousfeatures of the invention, which are, for brevity, described in thecontext of a single embodiment, may also be provided separately or inany suitable subcombination.

Although the invention has been described in conjunction with specificembodiments thereof, it is evident that many alternatives, modificationsand variations will be apparent to those skilled in the art.Accordingly, it is intended to embrace all such alternatives,modifications and variations that fall within the spirit and broad scopeof the appended claims. All publications, patents and patentapplications mentioned in this specification are herein incorporated intheir entirety by reference into the specification, to the same extentas if each individual publication, patent or patent application wasspecifically and individually indicated to be incorporated herein byreference. In addition, citation or identification of any reference inthis application shall not be construed as an admission that suchreference is available as prior art to the present invention.

1. Apparatus for providing an analysis of attitudes expressed in websites, comprising: a collector for collecting attitude-data in relationto a predetermined subject from at least one pre-selected web site, saidattitude-data containing attitudes in relation to said predeterminedsubject; a processor, associated with said collector, for processingsaid attitude data so as to generate an attitude analysis; and anoutputter, associated with said processor, for outputting said attitudeanalysis, thereby to provide an indication of attitudes being expressedin said web content in relation to said predetermined subject.
 2. Theapparatus of claim 1, wherein said collector further comprises at leastone crawler, configured for crawling said at least one respectivepre-selected web site for the attitude data in relation to saidpredetermined subject.
 3. The apparatus of claim 2, wherein said crawleris further configured to download relevant pages of the at least onepre-selected web site according to a predetermined schedule.
 4. Theapparatus of claim 1, wherein said collector is further configured toextract relevant data from said relevant pages of the pre-selected website.
 5. The apparatus of claim 1, wherein said collector is furtherconfigured to create a mark-up language format representation of saidrelevant data.
 6. The apparatus of claim 5, wherein said mark-uplanguage is XML.
 7. The apparatus of claim 1, wherein said collectorfurther comprises a data integrator, configured for integrating theattitude data into a complete and non-redundant attitude data.
 8. Theapparatus of claim 1, further comprising a collecting policy definer,associated with the collector, operable for defining specific guidelinesfor collecting with respect to one of said at least one pre-selected website.
 9. The apparatus of claim 1, wherein said processor is furtherconfigured for carrying out an initial preprocessing step for extractingmetadata of said attitude-data.
 10. The apparatus of claim 1, whereinsaid processor is further operable for categorizing text of saidattitude-data.
 11. The apparatus of claim 10, wherein saidcategorization comprises categorization of said attitude data accordingto style.
 12. The apparatus of claim 10, wherein said categorizationcomprises categorization of said attitude data according to content. 13.The apparatus of claim 1, wherein said processor further comprises astatistics generator, configured for generating statistics based on saidprocessing attitude-data.
 14. The apparatus of claim 1, wherein saidprocessor further comprises a concept analyzer, configured for analyzingsaid attitude-data and for finding in said attitude-data at least onerelevance-relationship between a phrase and a respective conceptrelating to said subject.
 15. The apparatus of claim 14, wherein saidconcept is an attitude.
 16. The apparatus of claim 1, wherein saidprocessor further comprises a correlation measurer, configured formeasuring in said attitude-data correlations among phrases having arelevance-relationship with a common concept and between at least one ofsaid phrases and said common concept, said concept relating to saidsubject. 22-25. (canceled)
 26. The apparatus of claim 1, wherein saidoutputter is implemented using as web browser based application. 27-43.(canceled)